Learned Routing + Two-Pass N-gram Rescoring + Extended Orders (2-12) by pappanick · Pull Request #860 · openai/parameter-golf

pappanick · 2026-03-26T15:45:54Z

Summary

Combines techniques from PRs #834, #846, #733, and #693 into a single submission with 9 innovations.

Techniques

#	Technique	Source	Expected Δ BPB
1	Learned routing head `Linear(512,12)`	PR #834	base
2	Two-pass cold-cache rescoring (15 chunks)	PR #846	-0.01 to -0.03
3	Extended n-gram orders 2-12 (8M bucket hash)	Novel	-0.005
4	Gated Attention (per-head learned gate)	PR #733/#638	-0.002
5	Value Residual Learning (λ_v · x0 shortcut)	PR #733/#657	-0.002
6	Depth Recurrence (layers 4,5 repeated → 13 virtual)	PR #733/#686	-0.006
7	SGD TTT (lr=0.002, all blocks unfrozen)	PR #733	faster, less memory
8	CROWN-Q quantization regularizer	PR #693	better int5 quality
9	Per-order adaptive min_count thresholds	Novel	better sparse n-grams

Architecture

PR #834/414 stack: 11 physical layers (13 virtual via depth recurrence), 512d, 8H, 8KV, LeakyReLU(0.5)^2, U-Net skips, SmearGate, BigramHash(6144), Partial RoPE (16/64), XSA all layers, VE128 on layers 9-10, EMA+SWA, GPTQ int5 + zstd-22.

Key Innovation: Depth Recurrence without Bank Refactor

Instead of PR #733's parameter bank approach, we use shared module references: repeat blocks share CastedLinear weights from physical blocks but own independent scalar params (attn_scale, mlp_scale, attn_gate, lambda_v). Before TTT, untie_recurrence() deep-copies the heavy weights so repeat layers can specialize independently. ~1% param overhead during training, full independence during TTT.

Two-Pass Rescoring

Pass 1: Standard sequential chunk eval with causal n-gram cache building.
Pass 2: Rescore first 15 chunks with the full cache (no updates). Early chunks improve dramatically since their n-gram experts now have full context. Per-chunk loss tracking enables precise delta computation.

Status

Code compiles, all syntax checks pass
Full pipeline verified on MPS (Apple Silicon)
Depth recurrence: weight sharing + untie tested
N-gram hash collision test for all 11 orders
Two-pass loss delta computation verified
Full 8xH100 training + eval run (need compute credits)
3-seed validation

Run Command

RECUR_LAYERS="4,5" GATED_ATTENTION=1 VALUE_RESIDUAL=1 \
TWO_PASS_ENABLED=1 TWO_PASS_RESCORE_CHUNKS=15 \
NGRAM_MAX_ORDER=12 NGRAM_BUCKETS=8388608 \
TTT_OPTIMIZER=sgd TTT_LR=0.002 TTT_FREEZE_BLOCKS=11 \
NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=6144 XSA_LAST_N=11 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

Learned routing + frozen oracle: PR Record: 0.1663 BPB - N-gram-Aware Training + Frozen N-gram Oracle + Backoff TTT #834 (@AnirudhRahul)
Two-pass rescoring concept: PR Record: Two-Pass N-gram Rescoring (val_bpb 0.1434) #846 (@himanshudongre)
Depth recurrence + gated attn + value residual: PR Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0278, 3-seed mean) #733 (@stukenov)
CROWN-Q: PR Record: CROWN-Q + Full GPTQ + SWA/EMA Blend — val_bpb 1.1186 (3-seed mean) #693
Base architecture: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 (@signalrush), PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 (@abaybektursun)
N-gram cache: PR Record: 5-gram Eval Cache + LeakyReLU² + Parallel Muon val_bpb: 1.0920 (3-seed mean, std 0.0007) | ~15.9 MB | 8×H100 SXM #659/Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 (@deanbrr)

…ders Combines PR openai#834's learned multi-expert routing head with PR openai#846's two-pass cold-cache rescoring. Key changes: - Extended n-gram orders from 2-7 to 2-12 with 8M bucket hash tables - Two-pass eval: rescore first 15 chunks with full cache after pass 1 - Per-chunk loss tracking for precise pass-1/pass-2 delta computation - Configurable via env vars: NGRAM_MAX_ORDER, NGRAM_BUCKETS, TWO_PASS_ENABLED, TWO_PASS_RESCORE_CHUNKS Based on PR openai#834 (AnirudhRahul) + PR openai#846 (himanshudongre) stack.

- Per-head learned gate in attention (PR openai#638/openai#733): -0.002 BPB - Lambda_v * x0 shortcut from initial embedding (PR openai#657/openai#733): -0.002 BPB - Both enabled by default via GATED_ATTENTION=1, VALUE_RESIDUAL=1 - Added attn_gate, lambda_v to control tensor patterns for proper quantization handling - All smoke tests pass on CPU

…eader Major additions: - Depth recurrence: layers 4,5 repeated -> 13 virtual from 11 physical Repeat blocks share heavy CastedLinear weights, own scalar params untie_recurrence() deep-copies before TTT for independent specialization Only ~1% param overhead during training - TTT defaults changed to match PR openai#733 winning recipe: - SGD optimizer (was AdamW) - simpler, less memory - lr=0.002 (was 0.0005) - higher for SGD - Unfreeze all 11 blocks (was 2) - more params for adaptation - All repeat_blocks params unfrozen for TTT Configurable via: RECUR_LAYERS="4,5" TTT_OPTIMIZER=sgd TTT_LR=0.002 All smoke tests pass on CPU (syntax, recurrence, weight sharing, untie).

pappanick added 3 commits March 26, 2026 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learned Routing + Two-Pass N-gram Rescoring + Extended Orders (2-12)#860

Learned Routing + Two-Pass N-gram Rescoring + Extended Orders (2-12)#860
pappanick wants to merge 3 commits intoopenai:mainfrom
pappanick:submission/learned-twopass-ngram

pappanick commented Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pappanick commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Techniques

Architecture

Key Innovation: Depth Recurrence without Bank Refactor

Two-Pass Rescoring

Status

Run Command

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pappanick commented Mar 26, 2026 •

edited

Loading